Fetching Text from a Website

Develop the web scraping and data saving program.

We'll cover the following

These are the libraries that we will be using.

import requests
from bs4 import BeautifulSoup
import openpyxl

Getting text#

The first function you want to create will accept a website address (URL) as an argument and return the text (the code as str) of the website. This is a good, neutral function that can be included in any other program you write because it is agnostic to the URL.

Headers are what your browser sends along with its request to access a webpage. The user-agent defines what type of computer is making the request. Because requests without a user-agent are very obviously robots, it can be good practice to include your normal user-agent to show that you mean no harm. The easiest way to find your browser’s user-agent is to type “What is my user agent?” into a search engine. Requests will still work if you just get() the URL, but this type of request is the most likely to get blocked by a website since it is obviously coming from a bot. Some websites auto-block these requests to prevent different types of attacks. Having said that, do not add a user-agent to do malicious things to a website!

See what the response looks like for the Engineering Toolbox website:

Seeing a Response object's text

Parsing HTML#

By itself, the requests.Response object contains the text of the website but also the status code, any json, history, encoding, URLs, etc. You just want the text, similar to what you used for plotting airfoils.

Next is the hard part: parsing the HTML. Constructing the “soup” is fairly simple:

soup = BeautifulSoup(response, features="lxml")

This line uses the BeautifulSoup library to convert all of the text into a navigable BeautifulSoup object. features is an optional argument to tell BeautifulSoup which HTML parser to use.

Now you can navigate through the various tags to get the exact HTML snippet that you need. Though the website looks good when it is rendered on a browser for you, the code that makes it look pretty is a little more like spaghetti. A key element of web scraping is to dig through HTML code to find the “HTML coordinates” of the information that you are looking for. This requires looking through the website’s source code to find HTML keywords or tags for the data that you are looking for.

Once you have found a tag to pull, the best way to find a tag you are looking for is by using the soup.find() method and supplying a class_= argument. Notice that the class_ keyword has to have the underscore because otherwise, it would shadow the built-in class keyword, which is used to define and create new data types. The find method gives us a nested list of each level of tags, so you will usually have to use a couple of nested for loops to get to the data you want.

Your goal is to find all of the HTML that creates the Aluminum Properties table. The HTML on the right is what creates the webpage on the left. Specifically, what is highlighted on the right creates what is highlighted on the left.

Comparison of rendered webpage and HTML that builds the rendered webpage
Comparison of rendered webpage and HTML that builds the rendered webpage

In order to find this source code, view the page source for that website by right-clicking on the website and selecting “View Page Source.” Search for “Aluminum Alloy,” which is the column header of the left-most column. The ninth instance of “Aluminum Alloy” is what we want: <table class="large tablesorter"> <thead> <tr><th style="width: 16%;">Aluminum Alloy</th>. This table class can also be found by right-clicking inside of the table on the webpage, clicking “Inspect Element,” and looking to see if there is a table class or table ID. Finding a specific class name means that you can specifically call out the table you want to scrape. You should be able to find the table with a class of "large tablesorter" in the soup. As a reminder, when you try to find() an HTML tag with a class property, the class keyword has to have an underscore after it. So, you will pass class_="large tablesorter" to the soup.

What comes next is a lot of guessing and checking to see how you can isolate the column headers and the subsequent rows by their HTML properties. By doing this, you get the exact “HTML coordinates” that you can always use to know how to find the material properties. The function enumerate() is a good troubleshooting tool to print out the indices and cells to help see if anything stands out.

table = soup.find('table', class_="large tablesorter")
for row in table:
    for index, tr in enumerate(row):
        print(index, tr)

This code printed a long list of tags to the console:

Iterating through table rows

The second major line of code corresponds to this row in the aluminum properties table:

Comparing HTML to the rendered webpage
Comparing HTML to the rendered webpage

Introduction to Material Properties

Narrowing in on the Data